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Funding  period:  01  July  2014  -  31  March  2015 


A.  Study  Goal 

The  goal  of  the  short-term  research  proposal  is  to  develop  and  test  a  computational  method  for  single-cell 
analysis  in  order  to  assess  the  feasibility  of  a  recently  submitted  research  proposal  entitled  "Single  cell  analysis 
of  dedifferentiation  and  transdifferentiation  in  mammalian  regeneration"  (hereafter  referred  to  as  the  Main 
Proposal)  in  collaboration  with  Prof.  Ken  Muneoka. 

B.  Studies  and  Results 

In  the  past  year,  we  have  continued  to  systematically  investigate  the  targeting  mechanisms  of  epigenetic 
factors  using  a  two-step  approach.  In  the  first  step,  we  investigated  the  cell-type  dependent  plasticity  of 
epigenetic  patterns  and  the  role  of  DNA  sequence  in  mediating  the  degree  of  plasticity.  In  the  second  step,  we 
further  incorporated  gene  expression  data  to  identify  transcription  factors  that  play  a  role  in  modulating  the 
epigenetic  patterns  in  a  cell-type  specific  manner.  To  this  end,  we  developed  and  experimentally  validated  a 
computational  method  to  predict  the  genome-wide  distribution  of  epigenetic  plasticity  and  cell-type  specific 
recruiting  factors  [1].  In  addition,  we  also  applied  computational  and  experimental  approaches  to  study  a 
number  of  other  related  issues,  such  as  the  stability  of  bivalent  domains.  Our  work  has  resulted  five 
manuscripts  that  have  been  or  are  close  to  be  published  in  peer-reviewed  scientific  journals  [1-5].  In  the 
following,  I  will  describe  the  progress  we  have  made  during  this  funding  period. 

Aim  1 :  We  will  develop  a  computational  approach  to  investigate  the  role  of  DNA  sequences  in  the 
regulation  of  genome-wide  epigenetic  patterns 

For  model  development,  we  chose  H3K27me3  as  a  proof-of-concept  due  to  its  well-recognized  important  role 
in  developmental  control  and  the  availability  of  large  amount  of  public  data.  We  obtained  the  H3K27me3  ChlP- 
seq  datasets  in  19  human  cell  lines  from  the  ENCODE  consortium  and  quantified  the  plasticity  of  H3K27me3 
occupancy  across  these  cell-types  at  each  genomic  locus.  We  then  focused  on  the  highly  plastic  regions 
(HPRs)  because  they  are  directly  related  to  cell-type  identity  maintenance,  as  suggested  by  function 
enrichment  and  sequence  conservation  analyses.  Consistent  with  the  literature,  we  found  that  the  majority  of 
HPRs  are  proximal  to  CpG  islands,  which  were  previously  shown  by  Brad  Bernstein  and  others  to  play  an 
important  role  in  recruitment  of  Polycomb  group  (PcG)  proteins.  On  the  other  hand,  44%  of  the  HPRs  are 
located  in  distal  regions,  where  the  recruitment  mechanism  is  less  clear. 

To  investigate  the  role  of  DNA  sequence  in  mediating  the  degree  of  plasticity,  we  applied  our  N-score  model  to 
predict  the  location  of  the  HPRs  based  on  DNA  sequence  information.  Our  model  provides  substantial 
prediction  power  (AUC  =  0.82),  suggesting  the  DNA  sequence  indeed  plays  a  significant  role.  Of  note,  many  of 
the  proximal  HPRs  can  be  predicted  by  using  GC-  and  CpG-content  alone,  but  additional  sequence  features 
are  needed  for  accurate  prediction  of  distal  HPRs.  We  also  investigated  the  mechanisms  for  establishment  of 
cell-type  specific  patterns  at  the  distal  HPRs.  The  results  are  described  under  Aims  2  and  3. 

Recent  studies  suggest  another  important  mechanism  for  PcG  recruitment  which  is  through  interaction  with 
lincRNAs.  By  using  RIP-chip  experiments,  John  Rinn's  lab  identified  about  hundreds  of  lincRNAs  that  may 
interact  with  EZH2/PcG.  To  investigate  a  role  of  the  DNA  sequence  in  mediating  PcG-lincRNA  interactions,  we 
adapted  a  BART  model  to  predict  the  EZH2-bound  lincRNAs  and  obtained  excellent  prediction  accuracy  (AUC 
=  0.92).  Our  work  also  provides  new  insights  into  the  RNA  structure  in  mediating  such  interactions.  A 
manuscript  is  currently  in  preparation. 


Lastly,  we  were  invited  by  Dr.  Suzanna  Vinga  to  contribute  a  review  paper  in  a  special  issue  of  Briefings  of 
Bioinformatics  [5].  In  this  paper,  we  surveyed  recent  development  of  alignment-free  methods  and  their 
applications  in  epigenomics.  Our  survey  also  suggests  the  availability  of  unexplored  methods  that  could  lead  to 
new  discoveries  in  epigenomics  research. 

Aim  2:  We  will  develop  a  computational  approach  to  predict  tissue-specific  epigenetic  patterns  by 
integrating  DNA  sequence  information  together  with  gene  expression  data. 

Following  our  initial  analysis  of  the  H3K27me3  plasticity,  we  aimed  at  identifying  transcription  factors  that  play 
a  role  in  modulating  cell-type  specific  patterns.  To  this  end,  we  developed  a  computational  pipeline  that 
integrates  both  DNA  sequence  and  gene  expression  data  in  four  steps:  1)  identification  of  the  subset  of  HPRs 
that  are  specific  to  a  particular  cell-type;  2)  identification  of  enriched  TF  motifs  using  the  matched  genomic 
background;  3)  refining  the  list  of  motifs  by  testing  the  enrichment  in  center  vs  flanking  regions;  and  4)  mapping 
the  motifs  to  TFs  and  removing  non-functional  TFs  by  integrating  gene  expression  data. 

In  total,  our  analysis  predicted  41  cell-type  specific  TF-HPR  associations  in  the  19  ENCODE  cell  lines.  These 
predictions  were  validated  by  using  independent  data  available  in  the  literature,  including  ChIPseq  data, 
shRNA  data,  and  functional  analysis.  As  such,  we  found  significant  support  for  most  of  our  predictions.  As  a 
more  stringent  validation,  we  conducted  experiments  to  directly  test  the  function  of  predicted  TFs  in  a  model 
system,  as  described  below. 

Aim  3:  We  will  experimentally  validate  the  computational  predictions. 

We  used  primary  human  erythroid  pregenitors  (ProEs)  as  a  model  system  to  validate  our  predictions.  In 
previous  work  [6],  we  generated  genome-wide  profiles  of  transcription,  histone  modification,  and  transcription 
factors  in  adult  and  fetal  ProEs  to  investigate  the  mechanism  underlying  developmental  stage-specific  gene 
activities.  We  combined  the  P13K27me3  data  in  ProEs  and  in  the  ENCODE  cell  lines  to  carry  out  our 
computational  pipeline. 

Surprisingly,  our  model  predicted  TALI ,  which  is  a  principal  regulator  for  hematopoietic  development  and 
commonly  known  as  a  transcriptional  activator,  to  play  a  major  role  in  mediating  ProE-specific  PI3K27me3 
patterns,  suggesting  a  previously  unrecognized  function  of  TALI .  To  identify  the  cofactors  associated  with  this 
repression  role,  we  searched  for  TF  motifs  that  are  differentially  enriched  between  the  TAL1+Pi3K27me3  and 
TAL1+PI3K27ac  regions,  and  predicted  GFI1 B  as  a  leading  candidate.  To  test  these  predictions,  we  conducted 
ChIPseq  and  co-IP  experiments  as  validation.  Our  ChIPseq  data  confirm  that  TALI  colocalizes  with 
P13K27me3  at  the  ProE-specific  PIPRs,  and  our  co-IP  data  show  that  EZPI2/PRC2  can  pull  down  both  TALI 
and  GFI1 B,  and  that  GFI1 B  can  pull  down  EZPI2.  These  data  strongly  validate  our  computational  approach  and 
highlight  the  utility  of  chromatin  plasticity  analysis  in  uncovering  novel  mechanisms. 

We  developed  an  allelic-imbalance  approach  for  studying  the  molecular  functions  of  GWAS  variants  [4],  As  a 
proof-of-concept,  we  focused  on  a  well-characterized  enhancer  of  BCL1 1  A,  which  was  previously  shown  to 
repress  FibF  in  adult  erythroid  cells.  The  region  contains  a  number  of  genetic  variants  associated  with  the  HbF 
level.  To  test  the  function  of  these  variants,  we  adapted  a  TALEN-based  assay  to  delete  the  target  DNA 
sequence  and  identified  a  single  variant  that  causally  reduced  GATA1  binding  and  increased  FibF  expression. 
This  approach  will  be  used  to  experimentally  validate  our  computational  predictions  in  the  future. 

C.  Significance 

We  have  developed  a  powerful,  generally  applicable  approach  for  investigating  the  mechanism  underlying  the 
plasticity  of  epigenetic  patterns.  Our  approach  systematically  investigates  the  role  of  DNA  sequence  in 
modulating  plasticity,  and  provides  a  useful  tool  to  identify  key  regulators  modulating  cell-type  specific 


epigenetic  patterns.  Our  approach  has  lead  to  new  insights  into  the  multi-faceted  role  of  master  regulators, 
such  as  TALI ,  in  the  maintenance  of  cell  identity. 

D.  Plans 

We  plan  to  extend  our  previous  work  by  applying  our  model  to  other  epigenetic  marks.  In  this  direction,  we 
have  initiated  collaboration  with  Dr.  Andrew  Feinberg's  group,  who  has  pioneered  cancer  epigenetics  and 
recently  discovered  the  extensive  DNA  methylation  variability  among  cancer  patients.  By  using  our 
computational  pipeline,  we  are  working  to  identify  the  underlying  regulator  for  DNA  methylation  variability.  We 
will  also  extend  our  model  by  further  incorporating  multiple  epigenetic  marks  and  investigating  the  plasticity  of 
the  combinatorial  states.  We  are  currently  funded  by  NFIGRI  (R21 HG006778)  to  develop  a  statistical  model 
that  characterizes  the  hierarchical  chromatin  structure  by  integrating  multiple  chromatin  marks  and  will 
generate  genome-wide  maps  of  hierarchical,  combinatorial  chromatin  states  in  ENCODE  cell  lines.  These 
maps  will  serve  as  the  basis  for  study  combinatorial  state  plasticity.  The  predictions  will  be  experimentally 
validated  by  using  the  ChIPseq,  shRNA,  and  genome  editing  methods  as  we  done  before.  Lastly,  we  have 
started  to  package  our  algorithm  as  an  open-source,  user-friendly  software  to  make  it  accessible  to  the  broad 
community. 
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The  SCUBA  software  generated  by  the  studies  described  in  the  above  has  been  deposited  in  Github.  URL: 
https://github.com/gcyuan/SCUBA 


